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(S) A compact microelectronic device for performing modular multiplication and exponentiation over 
large numbers. 



A compact synchronous microelectronic 
peripheral machine for standard microproces- 
sors with means for proper clocking and con- 
trol, has as essential elements: three main 
subdivided, switched and clocked shift regis- 
ters, B, S, and N; two only multiplexed 
serial/parallel multipliers ; borrow detectors, 
ancillary subtracters and adders; delay regis- 
ters and switching elements ; all of which em- 
body a totally integrated concurrent and 
synchronous process approach to modular 
multiplication, squaring, and exponentiation. A 
method for carrying out modular multiplication, 
wherein the multiplicand A , the multiplier B and 
the modul, N, comprise m characters of k bits 
each, the multiplier not being greater than the 
modulus, is also described, wherein the multip- 
licand can be much larger than the modulus. It 
is demonstrated how the device can be used as 
a large number processor in the normal field of 
numbers. 
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The present invention relates to modular processing of large numbers in the Galois field of prime numbers 
and also of composite prime modules. More specifically, the invention relates to a device to implement modular 
multiplications/exponentiations of large numbers, which is suitable for performing the operations essential to 
Public Key Cryptographic authentication and encryption protocols, which cannot be executed with small mi- 
5 cro processors in reasonable processing time. 

The present invention relates to the hardware implementation of a procedure known as "the interleaved 
Montgomery multiprecision modular multiplication method" often used in encryption software oriented sys- 
tems. A unique original method is provided to accelerate modular exponentiation; and vital proofs are used to 
simplify the architecture and extend the use of the device to large number calculations in the normal field of 
10 numbers. 

The basic process is one of the three published related methods for performing modular multiplication 
with Montgomery's methodology. [P. L Montgomery, "Modular multiplication without trial division". Mathemat- 
ics of Computation, vol. 44, pp. 519-521 , 1985), hereinafter referred to as "Montgomery", [S.R. Dusse and B.S. 
Kaliski Jr., "A cryptographic library for the Motorola DSP 56000", Proc Eurocrypt '90, Springer-Verlag, Berlin, 
15 1990] hereinafter referred to as "Dusse". 

In this hardware implementation, securfty mechanisms and "on the ft/ additions, subtractions, and moves 
have been added; processes whose total output might be irrelevant have been removed; a relatively easy to 
implement on silicon type of design has been invented and has been integrated to be appended to the internal 
data/address bus as a slave to virtually any 8, 16 or 32 bit Central Processing Unit (CPU). 
20 Because of the simple synchronized shift design, the multiplying/squaring machine can run at clock 

speeds several times faster than speeds presently attainable with CPU's which support on board non-volatile 
memory devices. This method demands no design changes in the memory architecture of the CPU as pre- 
scribed by implementations using parallel multipliers and dual ported memories for fast modular multiplication 
of large numbers as in the Philips circuit philips Components, "83C65Z secured 8-bit microcontroller for con- 
25 tidrtional access applications', Einhoven, August 1990], hereinafter referred to as "Philips". 

The essential architecture is of a machine that can be integrated to any microcontroller design, mapped 
into memory; while working in parallel with the controller which must constantly load commands and operands, 
then unload and transmit the final answer. 

The unique solution uses only two serial/parallel multipliers, and a complete serial pipelined approach that 
30 saves silicon area. Using present popular technologies, it enables the integration of the complete solution in- 
cluding a microcontroller with memories onto a 4 by 4.5 by 0.2 mm microelectronic circuit that can meet the 
ISO 7816 standards. [Inter national Organization for Standardization, "Identification cards - integrated circuit 
cards", ISO 7816: 

Part 1 - ISO 7816-1 , "Physical characteristics", 1987. 
35 Part 2 - ISO 7816-2, "Dimensions of locations of contacts", 1988. 

Part 3- ISO/IEC 7816-3, "Electronic signals & transmission protocols", 1989.] hereinafter referred to as 
"ISO 7816". 

The invention is directed to the architecture of this solution, based on mathematical innovations, published 
by Montgomery, with several modifications and improvements and non-obvious methods are provided for re- 
40 during the time necessary for modular exponentiation to little more than half the time required using known 
processing and the Montgomery method. 

Definitions, General Principles and Methods 

45 The invention will be illustrated in the description to follow, making use of the general principles and meth- 
ods described below. 

For modular multiplication in the prime and composite prime field of numbers, we define A and B to be 
the multiplicand and the multiplier, and N to be the module which is usually larger than A or B. N may in some 
instances be smaller than A. We define A, B, and N as nvk = n bit long operands. Each k bit group will be called 

so a character. Then A, B, and N are each m characters long. For ease in following the first implementation and 
in the step by step procedural explanation, assume that A, B, and N are 512 bits long, (n = 512); assume that 
k is 32 bits long because of the present cost effective length of the multipliers; and m = 16 is the number of 
characters in an operand and also the number of iterations in a squaring or multiplying loop with a 512 bit op- 
erand. Obviously, all operands are integers. 

55 We use the symbol, a, to denote congruence of modular numbers, for example 16 = 2 mod 7, and we say 
16 is congruent to 2 modulo 7 as 2 is the remainder when 16 is divided by 7. When we write Y mod N = X mod 
N; both Y and X may be larger than N; however, for positive X and Y, the remainders will be identical. Note 
also that the congruence of a negative integer Y, is Y + u N, where N is the modulus, and if the congruence 
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of Y is to be less than N, u wilt be the smallest integer which will give a positive result . 

We use the symbol, * to denote congruence in a more limited sense. During the processes described 
herein, a value is often either the desired value, or equal to the desired value plus the module. For example 

5 X ¥ 2 mod 7. X can be equal to 2 or 9. We say X has limited congruence to 2 mod 7. 

When we write X = A mod N, we define X as the remainder of A divided by N; e.g., 3 = 45 mod 7. 
In number theory the modular multiplicative inverse is a basic concept For example, the modular multi- 
plicative inverse of X is written as X- 1 , which is defined by X X 1 mod N = 1. if X = 3, and N = 13, then X 1 = 
9, i.e., the remainder of 3-9 divided by 13 is 1. 
10 The acronyms MS and LS are used to signify most significant and least significant when referencing bits, 
characters, and full operand values. 

Throughout this specification N designates both the value N, and the name of the shift register which con- 
tains N. A and N are constant values throughout an entire exponentiation. A is the value of the number which 
is to be exponentiated. During the first iteration of an exponentiation, B is equal to A. B is also the name of 
15 the register wherein the accumulated value which finally equals the desired result of exponentiation resides. 

S designates a temporary value, and also the register in which the ¥ of S is stored. S(M) denotes the value 
of S at the outset of the i'th iteration; So denotes the LS character of an S(i) 'th value. 

We refer to the process, (defined later) ^A B)n as multiplication in the P field, or sometimes, simply, a 
20 multiplication operation. 

Other symbols are those conventionally used in the arithmetics. 

Montgomery Modular Multiplication 

25 In a classic approach for calculating a modular multiplication, A B mod N, the remainder of the product 
A B is calculated by a division process. Implementing a division operation is more difficult to perform than a 
multiplication operation. 

By using Montgomery's modular reduction method, the division is essentially replaced by multiplications 
using precalculated constants. 

30 

The Montgomery function <P (AB)n performs a multiplication modulo N of the AB product into the P field. 
The retrieval from the IP field back into the normal modular field is performed by enacting P on the result 
of P (A B)n and a precalculated constant H. Now, if P = 9 (AB) N , then P (P H) N = A B mod N; thereby per- 

35 forming a normal modular multiplication in two ^ f ield multiplications. 

The intention of efficient modular reduction methods is to avert a series of multiplication and division op- 
erations on operands that are n and 2n bits long, by performing a series of multiplications, additions, and sub- 
tractions on operands that are n bits long, and that yield a final result that is a maximum of n bits long. In order 
to illustrate the Montgomery precept, we observe that forgiven A, B and odd N (these odd modules are always 

40 either simple or a composite of large primes), there is always a Q, such that A B + Q N will result in a number 
whose n LS bits are zero, or 

P2" = AB + Q.N 

This means that we have an expression 2n bits long, whose n LS bits are zero. 
Now, let l-2 n s 1 mod N (I exists for all odd N). Multiplying both sides of the previous equation by I yields 
45 the following congruences: 

from the left side of the equation: 

P I 2 n = N; (Remember that 1 2 n = 1 mod N) 

and from the right side: 

A B I + Q-N*l = AB-I mod N; (Remember that Q-N-l = 0 mod N) 

so therefore: 

P = A B-I mod N. 

Unfortunately, this also means that a parasitic factor I is introduced each time a Afield multiplication is 
performed. 

55 We define the operator such that 

P s A B I mod N = ? (A B)m. 
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and we call this "multiplication of A times B in the & field". 

The retrieval from the 0* field is calculated by operating ^ on P-H, making: 

5 ^ (P'H)n = A-B mod N ; 

We can derive the value of H by substituting P in the previous congruence. We find: 

^ (P H) N s (A-B-l)(HX0 mod N; 
(see thatA B I <- P; H<-H; k- and any multiplication operation introduces a parasitic I) 
If H is congruent to the multiple inverse of I 2 then the congruence is valid, therefore: 
10 H = h 2 mod N = 2 20 mod N 

(H is a function of N and we call it the H parameter) 

To enact the 0> operator on A-B we pursue the following process, using the precalculated constant J: 

1) X = AB 

2) Y - (X-J) mod 2" (only the n LS bits are necessary) 

3) Z = X + YN 

4) S = Z / 2" (The requirement on J is that it forces Z to be divisible by 2„) 

5) P V S mod N (N is to be subtracted from S, if S ^ N) 
Finally, at step 5): 

P V ^(ABW 

[After the subtraction of N, if necessary: 

P= & (AB) N .] 

Following the above: 

25 Y = A B-J mod 2" (using only the n LS bits) ; 

and: 

Z = A B + (A B J mod 2") N. 
In order that Z be divisible by 2" (the n LS bits of Z must be zero) the following congruence must exist 

[A_B + (A- B-J mod 2 n ) N] mod 2 n = 0 
30 in order that this congruence will exist N-J mod 2" must be congruent to -1 or. 

J = - KM mod 2". 

and we have found the constant J. 

J, therefore, is a precalculated constant which is a function of N only, and, obviously, we must always 
choose that positive J which is smaller than N. 
35 Therefore, as will be apparent to the skilled person, the process shown employs three multiplications, one 
summation, and a maximum of one subtraction, for the given A, B, N, and a precalculated constant; we obtain 

^(A B)n. Using this result, the same process and a precalculated constant, H, (a function of the module N) we 
are able to find A-B mod N. As A can be equal to B, this operator can be used as a device to square or multiply 
in the modular arithmetic. 
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Interleaved Montgomery Modular Multiplication 



In the previous section there was shown a method for modular multiplication which demanded multipli- 
cations of operands which were all n bits long, and results which required 2n + 1 bits of storage space. 

Using Montgomery's interleaved reduction (as described in the aforementioned paper by Dusse), it is pos- 
sible to perform the multiplication operations with shorter operands, registers, and hardware multipliers; ena- 
bling the implementation of an electronic device with relatively few logic gates. 

Using a k bit multiplier, it is convenient to define characters of k bit length; there are m characters in n; 
i.e., m k = n. 

J 0 will be the LS character of J. 
Therefore: 

J 0 == -IMo 1 mod 2 k (J 0 exists as N is odd). 

Then, using Montgomery's interleaved reduction, P (A B^ is enacted in m derations with the following 
55 initial condition, pursuing steps (1 ) to (5). The circuit of the invention follows these steps in a concurrent fash- 
ion. 

Initially S(0) = 0 (the * value of S at the outset of the first iteration). 
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Fori = 1,2....m: 

(1) X = S(i-1) +A|_ 1 *B(A M is the i-1 th character of A ; S(M) is the value of S at the outset of the rth iter- 
ation.) 

(2) Y 0 = Xo-Jo mod 2" (The LS k bits of the product of Xq-Jo) 

5 (The process uses and calculates the k LS bits only, e.g., the least significant 32 bits) 

(3) Z = X + Y 0 N 

(4) S(i) = Z/2* (The k LS bits of Z are always 0, therefore Z is always divisible by 2K This division is tanta- 
' mount to a k bit right shift as the LS k bits of Z are aD zeros; or as will be seen in the circuit, the LS k bits 

of Z are simply disregarded. 
10 (5) S(i) = S(i) mod N (N is to be subtracted from those S(i)'s which are larger than N ). 

Finally, at the last iteration (after the subtraction of N, when necessary), C = S(m) = ^A-B) N . To derive F 
=AB mod N, we must perform the <P field calculation, P (C-Hfo. 

Now, we prove that for all S(i)'s, S(i) is smaller than 2N (not included in Montgomery's proof.) 
We observe that for operands which are used in the process: 

S(i-1)<N;B<NandA,_ 1 <2^. 
(The first two inequalities hold, as at the outset of an iteration N is subtracted from S(M) and B, when 
they were either larger than or equal to N. The third inequality holds as 2 k is a k + 1 bit long number whose 
MS bit is "1", while A|_ y is a k bit long operand.) 
By definition: 

S(i) = Z/2 k (The value of S at the end of the process, before a possible subtraction) 
Substituting in the above set of equations: 

Z = S(i - 1) + A|_ v B + (Xo-Jomod 2*)N 
Note that taking the maximum value of each element in the previous equation we have the inequality on Z: 
Z<(N- 1) + (2* - 1MN - 1) + (2* - 1)*N = 2*N + 2*N - N - 2* 

and then certainly: 

Z<2"-N + 2*-N. 
Now, dividing both sides of the inequality by 2 k : 

Z/2 k <N + N, 

and we have proved that one subtraction of N is all that may ever be necessary to rectify an S(i) or a B. 
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Example 1 

an interleaved modular multiplication: 

The following calculations can easily be verified with a hand calculator which has a hexadecimal mode. 
Using the hexadecimal format, assume: 

N = a59, (the modulus), A = 99b, (the multiplier), B = 5c3 (the multiplicand), n = 12, (the bit length of N), k = 
4, (the size in bits of the multiplier and also the size of a character), and m = 3, as n = k m. 
J 0 = 7 as 7.9^-1 mod 16 and H = 2 2 - 12 mod a59 = 44b. 

The expected result is F =A B mod N = 99b-5c3 mod a59 == 375811 mod a59 = 220 ie . 
Initially: S(0) = 0 
Step 1 X = S(0) + Ao-B= 0 + b5c3 = 3161 

Y 0 = Xo-Jo mod 2 k = 7 

Z = X + Y 0 *N = 3f61 + 7a59 = 87d0 

S(1) = Z/2 k = 87d (which is smaller than N) 
Step 2 X = S(1) + A 1 B = 87d + 9 5c3 = 3c58 

Y 0 = Xo-Jo mod 2 k = 8-7 mod 2* = 8 

Z = X + Y 0 -N = 3c58 + 52c8 = 8f20 

S(2) = Z/2* = 8f2 (which is smaller than N) 
Step 3 X = S(2) + A2 B = 8f2 + 9 5c3 = 3ccd 

Y 0 = d-7 mod 2* = b 

Z = X + Y 0 N = 3ccd + ba59 = aeaO 

S(3) = ZJ2 k = aea, as S(3) > N, 
S(3) = aea - a59 = 91 

Therefore C= P (A B)n = 91 16 . 

Retrieval from the P field is performed by calculating ^(C.H) N : 
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Again initially: S(0) = 0 
Step 1 X = S(0) + CqH = 0 + 144b = 44b 
Y 0 = d 

Z = X + Y 0 N = 44b + 8685 = 8ad0 
5 S(1) = Z/2 k = 8ad 

Step 2 X = S(1) + C y H = 8ad + 944b = 2f50 
Y 0 = 0 

Z = X + Y 0 N = 2f50 + 0 = 2f50 
S{2) = Z72 k = 2f5 
fo Sfep 3 X = S(2) + Cz H = 2f5 + 044b = 2f5 
Y 0 = 3 

Z = X + Y 0 N = 2f5 + 3 a59 = 2200 

S{3) = Z/2 k = 220 16 
which is the expected value of 99b 5c3 mod a59. 
15 The validity of the operation can be understood intuitively, when we real ize that ifat each step we disregard 
k LS zeros, we are in essence multiplying the n MS bits by 2*. Likewise, at each step, the ith segment of the 
multiplier is also a number multiplied by 2* giving it the same rank as S(i). 
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Modular reduction on a Montgomery machine In one multiplication process 



Many cryptographic processes such as the NIST Digital Signatures Standard or modular exponentiation 
using the Chinese Remainder Theorem require reducing a number which is larger (often more than twice as 
large) than a second modulus. 

These modular reductions can efficiently be executed in one interleaved Montgomery multiplication using 
25 the machine of the invention and a non-obvious extension to the Montgomery algorithm. 

Note that in the previous examples, it was implied that n, the length of the modulus, of the operand was 
also the exact length of N. For ordinary exponentiations and multiplications this would be most efficient How- 
ever, in those cases where a reduction in size is necessary, use can be made of a second constant, M = 2 n 
mod N, which when Montgomery multiplied by the number to reduced, in one operation effects a minimum 
30 reduction. This constant, M, can be calculated with the same mechanism which calculates the constant H (see 
sections on calculating H parameter), by placing the module, N, in the most significant part of the divisor op- 
erand, so that its most significant "1" rests in the most significant bit of the divisor register. The number of 
shift/trial-subtracts, obviously, must now be n + 1 - U wherein L is the number of relevant bits of N. Note that 
this M will be an operand L bits long. 

To prove this premise, first we repeat that a Montgomery multiplication of A B mod N, (^(ABfo), yields 
the congruence AB-I mod N. If we assign B = H, then: 

P (A M)n = A M I mod N = A mod N. 



40 Example 2 



an interleaved Montgomery reduction: 

To demonstrate a reduction of t to mod q (t mod q), wherein the length of the multiplying register where t 
45 initially stored is 24 bits long, is larger than the length of q. 

Assume a word length (size of machine multiplier) of 8 bits, and the following test variables: 
n = 24; k = 8; t = 0af5 9b; q = 2b 13; and 
r = M = 2^ modq = 141d. 

Using a simple division calculation we know for comparison that t mod q = 5c8. 
so Note that the reduction and retrieval are performed in one Montgomery multiplication. 
Initially: S(0) = 0.A = t = 0af5 9b, B = R = 141d, N = q = 2b 13 
Sfep 1 X = S(0) + Ao B = 0 + 9b-141d = c2d8f 
Y 0 = Xo- Jo mod 2* = 8f - e5 mod 2» = eb 
Z = X + Y 0 - N = c 2d 8f + eb - 2b 1 3 = 33 b8 00 

55 S(1) ^ ZJ 2 k mod N = 33 b8 which is larger than N 

S(1) = 33b8-2b13 = 8a5 
Sfep 2 X = S(1) + A,B = 8 a5 + f5 ♦ 141d = 13 48 66 
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Y 0 = Xo Jo mod 2* = 66 - E5 mod 2 s = 3e 
Z = X+Y 0 N = 1348 66 + 3e 2b13 = 1d b7 00 
S{2) = Z/2 k modN = 1d b7 
Step3 X=S<2) + A2-B = 1db7 + 0A.141d=e6d9 
5 Y 0 = d9*e5mod2 8 = 1d 

Z = X + Y 0 N = e6 d9 + 1d • 2b 13 = 5 c8 00 
S(3) = Z/2 k modN = 5c8 
And t mod q = 5c8, as was previously calculated. 

w Exponentiation 

The following derivation of a sequence [D. Knuth, The art of computer programming, vol. 2: Sem [numerical 
algorithms, Addison-Wesley, Reading Mass. v 1981] hereinafter referred to as "Knuth", explains a sequence 
of squares and multiplies, which implements a modular exponentiation. 

Assuming that we have precalculated the constants in the above section, and that our device can both 

square and multiply in the ^ field; we wish to calculate: 

C — A E mod N. 

Let EQ) denote the j th bit in the binary representation of the exponent E, starting with the MS bit whose 
index is 1 and concluding with the LS bit whose index is q, we can exponentiate as follows: 
a) B = A 

FOR j = 2 TO q 

a) B ¥ P (B B) N 

b) B ¥ ^(BH) N (steps a and bare equivalent to B ^ ^modN) 
IF E© =1 THEN 

a) B * < P (B-A) N 

b) B ¥ P (B-H) N (steps a and b are equivalent toB ¥■ B*A mod N) 

In the transition from each step to the next, N is subtracted from B whenever B is larger than or equal to 

N. 

After the last iteration, the value B is ¥ to A E mod N. 

There are more efficient proprietary protocols that could be used with the described circuitry to perform 
modular exponentiation; we name two encryption protocols on which the method described herein will often 
double the speed of exponentiation. In the RSA method [R. L Rivest et al., "A method for obtaining digital 
signatures and public key cryptosystems", Comm. of the ACM, vol. 21 , 120-126, 1 978] hereinafter referred to 
as'RSA'andthe Diffie-Hellman protocol [W. Diffieand M. E. Hellman, "New directions in cryptography", IEEE 
Trans, on Inform. Theory, vol. IT-22, 644-654, 1976], hereinafter referred to as "Diff ie-Hellman", most of the 
difficult exponentiations are executed using a constant exponent The method of the following section (an ef- 
ficient method for a ret rival from a IP field exponentiation), reduces computation time for those computations 
where a constant exponent is used. When this method is used, steps b) in the described exponentiation proc- 
ess (all (^BH)n multiplications) are deleted, and the final value of B, after the q'th iteration of the exponen- 
tiation is multiplied in the Montgomery 0* field by a precalculated constant T. 

To those involved in the implementation, it is obvious that for full RSA signatures, with this circu ft ry, using 
the Chinese Remainder Theorem [described in the aforementioned article by Knuth], it is possible to make a 
further more than 70% reduction of the computation time. 

An efficient method for a retrtval from a -Afield exponentiation 

so 

The square and multiply protocol of the previous section can be improved, and it is possible to reduce the 
number of Afield multiplications during the iterative sequence by introducing a new precalculated constant, T, 
which is a function of the modulus, N, and the exponent E. 

T = (2")*modN = (h 1 ) 2 mod N. 

55 where 

I=2^ 1 + Emod 2^ 1 
and 
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q is the number of relevant bits in E (disregard any leading zeros). 
The modular exponentiation can now be calculated with the sequence: 
Initially: 

B = A 

5 FOR j = 2 TO q 

B* ^ (B B)n 
IF E0 =1 THEN 

B ¥ ' (B A) N 

10 END FOR 

B¥ IP (B-T)n 

Assume again, that on each transition from one step to the next N is subtracted from B, whenever B is 
larger than or equal to N. 

15 Note again that every multiplication in the ^ field is equivalent to a modular multiplication of the same 

factors by I, e.g., ? (X Y) = X Y I mod N. 
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Example 3 

This example demonstrates the use of T in the calculation of AE mod N and makes Ts definition obvious. 
Assume n =4 and E = 5 = 0101 2 . q (after discarding E*s leading zero) is 3, therefore: 
E(1) = 1;E(2) = 0; and E(3) = 1, 
and T is precalculated: 

T = (2")*modN = (M)* mod N 
2 = 2*- 1 + Emod2<J- 1 = 2 3 - 1 + 5mod2 3 ~ 1 = 4 + 1 = 5 

and therefore: 

T = h5modN. 

as is seen when- 
Initially. 
B = A 
j = 2,E(2) = 0 

B = P(BB) N = A2.|modN 
j = 3,E(3) = 1 

35 B= P(BB) N = B 2 = A 4 P-l mod N 

B = ^(B A) N = A 4 P A I mod N 
and finally: 

B ¥ P (B-T)n s A 5 -! 4 -! -6 1 mod N sA 5 mod N 
40 The introduction of the parameter T can be avoided if the following steps are followed in order to calculate 
A E : 

Assuming that we have precalculated the Montgomery constant, H, and that our device can both square 
and multiply in the P field, we wish to calculate: 

C = A E mod N. 

45 Let E(j) denote the j bit in the binary representation of the exponent E, starting with the MS bit whose 

index is 1 and concluding with the LS bit whose index is q, we can exponentiate as follows for odd exponents: 
A* ¥ P(A.H) N 
B = A* 

FORj = 2TO q- 1 

50 

B ¥ <P(B B) N 
IF E(j) = 1 THEN 

B Y ^B*A*)n 

ENDFOR 

55 B ¥ ^B A)n 

C = B 

In the transition from each step to the next, N is subtracted from B whenever B is larger than or equal to 

N. 
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After the last iteration, the value B is Y to A E mod N, and C is the final value. 
For even exponents, the last step could be: 
B ¥ <P(B-1) N instead of B ¥ * (B A)n 
To clarify, we shall use the following example: 

E = 1011 -> E(1) = 1; E(2) = 0; E(3) = 1; E(4) = 1; 
TofindA«"imodN;q=4 

A* = <P(AH) N = Ah 2 l=Ah 1 mod N 

B = A* 

for j = 2 to q 

B = ^BBfo which produces: A*(M)*I =A^M 
E(2) = 0; B = MM 

j = 3 B = <P(B.B)m = A2(M)2.| = A 4 M 

E(3) = 1 B = tf*(B-A*) N = (A 4 M) (AH>I = A*H 

j = 4 B = *>(B.B) N = A™ r^l = A 10 *M 
As E(4) was odd, the last multiplication will be by A, to remove the parasitic M. 

B = 0>(BA) = A 10 • M* Al = A 11 
C = B 

Calculating the H parameter 



The H parameter is a constant that is vital for computations in the Montgomery field. Using certain proto- 
cols, H will be a constant that might be precalculated on a larger computer, or in other cases it might be a useful 
25 constant which will be a first stage parameter used in calculating a more useful constant See the previous 
section. 

In regular communications it might be assumed that H will be precalculated, however, for several protocols, 
e.g., authenticating a signature in a random communication in RSA, it might be necessary to calculate H with 
this device, e.g., the Smart Card. 
30 The H parameter is defined as: 
H = 22" mod N. 

This means that H is the remainder of a normal division operation wherein a string with an MS bit of one 
followed by 2n LS zeros (a 2n + 1 bit long operand) is divided by the modular base N. 

Binary division by the divisor, N, of a dividend consisting of a "1" and a string of zeros, is tantamount to 
35 sequentially trial-subtracting N, i.e. subtracting N from the residual trial-dividend when the most significant 
n+1 bits are larger than N. (Follow the example.) 

Although the dividend is 2n+1 bits long, it will be obvious that the residual trial-dividend which is affected 
by a subtraction, is never more than n + 1 bits long, and the LS digits are zeros. 

For example: 

" Find H when N =1 1 10 =1 01 1 2 , (therefore the bit length of N is 4, i.e., n=4) 
Dividing, as we would manually perform long division base 2: 
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1011 II 0000 0000 

1011 SUCCESSFUL SUBTRACT 

0101 0 c= result of the 1st round 
101 1 NO SUBTRACT 
101 00 result of the 2nd round 

10 11 SUCCESSFUL SUBTRACT 
10 010 result of the 3rd round 

1 011 SUCCESSFUL SUBTRACT 
*5 0 1110 ^= result of the 4th round 

1011 SUCCESSFUL SUBTRACT 
RESULT 5'TH (n+1) round=> 0011 = H t3 base 10 = the remainder) 

20 

where we have verified that H = 3 10 . 

There are n + 1 trial subtractions in an H division process. Note also that the trial-dividend is also n + 1 
bits long. This sequence of subtractions will be followed in hardware in the description to follow. 

This invention relates to a compact synchronous microelectronic peripheral machine for standard micro- 
25 processors with means for proper docking and control, having as essential elements: three main subdivided, 
switched and clocked shift registers, B t S and N; two only multiplexed serial/parallel multipliers; borrow de- 
tectors, ancillary subtracters and adders; delay registers and switching elements; all of which embody a totally 
integrated concurrent and synchronous process approach to modular multiplication, squaring, and exponen- 
tiation. A further embodiment implements a unique, not straightforward, synchronized hardware derivation 
30 of the Montgomery method designed for hardware modular multiplication, squaring, and exponentiation. In 
the alternative it is also possible to execute a derivation of the Montgomery method as a multiplicity of simul- 
taneous serial processes, i.e., multiplications, subtractions, additions, stored delays, and a division by 2 k . The 
processes are executed in parallel, as serial processes merge. 

In yet a further embodiment it is possible to execute a derivation of the Montgomery method as a multi- 
35 plicity of serial processes for modular multiplication, squaring, and exponentiation, precluding the use of wide 
internal busses. It is further possible to make this derivation sufficiently compact to be fabricated on a micro- 
chip as specified by the ISO 7816 standards for portable Smart Cards using popular 1 micron technology. 

It is further anticipated that it is possible to execute a derivation of the Montgomery method as above de- 
cribed, as a multiplicity of serial processes for modular multiplication, squaring, and exponentiation, which can 
40 be controlled by any microprocessor with an internal bus, without changing its basic architecture, and specif- 
ically, without redesigning memories for dual port access, and with relatively small demands of firmware. 

Such a machine according to this invention may also use the microcontroller to regulate the cascade of 

P field sequences of squarings and multiplications, wherein the exponent E need not be stored in the MULT 
45 block, saving one n bit long shift register; simplifying the MULT control , while demanding little additional mi- 
crocontroller ROM code. 

According to a further embodiment of the machine according to this invention, as a result of loading the 
Ai register with the squaring multiplicands "on the fly", while the B register is rotating; precludes unloading by 
the microcontroller of the previous final values of B and/or B-N, in order to reload the A) register with B, char- 
so acters. This conserves microcontroller RAM, and eliminates at least n effective dock cycles on each squaring 
iteration. 

In a further variation two storage registers ami separate serial subtraction operations are deleted from a 
straight forward implementation of the Montgomery method. This is accomplished by enacting a single serial 
detection on Z/2 k minus N, to determine if ZJ2 k is larger than or equal to N and subsequently achieving smaller 
55 than N operands with only one serial subtraction. 

In yet a further embodiment the circuit is synchronized in such a quasi-parallel manner so that only two 
multipliers are used to perform three simultaneous multiplication operations. In a silicon implementation ser- 
ial/parallel multipliers can occupy 40% of the silicon area Using two instead of three serial/parallel multipliers, 
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conserves enough silicon to double the number of cells in the remaining two multipliers. This doubling of the 
multiplier size reduces the process time for a 512 bit multiplication by more than 45%. 

It is also understood that a machine according to the above, can be used with a digital delay circuit, Delay3, 
(a k bit shift register) to synchronize the serial addition of X with the serial result of the multiplier ML1 , Y 0 N; 
5 precluding double storage of products or a repetition of a serial/parallel multiplication. 

In a further embodiment there are two digital delay circuits, Delayl and Delay2, (two k bit shift registers), 
used to synchronize three serial multiplications in which N is a factor, Le., BA,, X J 0 , and Y 0 -N. 

Alternatively a machine as described above could be constructed, in which a digital delay circuit, Delay3, 
synchronizes the operation of the serial/parallel multiplier, ML2, so that it can perform two separate multipli- 
10 cation operations in the process stream, i.e., X J 0 and Y 0 N. 

It is also possible that such a machine, the registers S, B, and N are configured to be either n bits or n/2 
bits long; whereby exponentiation over n/2 length modules can be accomplished in little more than one eighth 
of the effective clock cycles that would be necessary for n bit length exponentiations. 

In yet a further embodiment a machine according to the above is anticipated, which when processed with 

15 an original retrieval factor T, can reduce the number of field multiplication operations on a full RSA signa- 
ture exponentiation to close to half. 

This machine could also, assuming necessary precalculations, execute the complete multiplication proc- 
ess P (A'B)N of an n bit number in only m(n+2k) effective clock cycles, because of "on the fly" loading of the 

20 A register, and "on the fly prediction" of the size of the contents of the S register, and "on the fly" synchroni- 
zation of the partial operands. 

This invention further includes a machine according to the above, using the same registers in the same 
machine as used for Montgomery multiplications, to which a small borrow detect circuit is appended and a 
simple addition to the controlling mechanism, operating in a second mode calculates the H parameter. 

25 It is further anticipated that every subprocess and process are executed with predetermined numbers of 

clock cycles, so that a P field multiplication and/or a squaring is performed in known sequences of clock 
cycles, enabling the embodiment of a simplified control consisting of a cascade of serf-exciting counting mech- 
anisms with no internal conditional branches. 

According to the invention it is anticipated that any of the machines described (infra or supra) could be 
provided an even more improved method for performing modular exponentiation of D = A E mod N, which com- 
prises the steps of: 

1. storing the exponent E in a computer register. 

2. loading the modulus into the aforesaid register N; 

3. setting the aforesaid register S to zero; 

4. performing a multiplication operation, by the method of application No. 104753, of A*= P (A H) N 
while A is the operand to be exponentiated, and H is a precaJculated parameter as defined before. 

5. loading A* into the base register B. 

6. performing a squaring operation of the contents of register B. 
40 7. shifting said exponent E left; 

8. ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said ex- 
ponent E, and for all the following bits performing the operations 9 to 10; 

9. for every one of said E bits, regardless of their being 0 or 1, carrying out operations 4 and 5 of the 
squaring method hereinbefore set forth, wherein both the multiplicand and the multiplier originate from 

45 the B register, and wherein the successive characters of the Montgomery multiplier are loaded into 

register A| from register B; 

10. if and only if the current bit of the exponent E is 1, carrying out, after performing operation 9, op- 
erations 4 and 5 of the multiplication method hereinbefore set forth, wherein the multiplicand is the 
content of register B and the multiplier is the base A*; and 

so 11 . after performing steps 8-1 0 for all bits of E, performing an additional multiplication of register B by 

the original base A and then storing the result of the last operation as D * A E modN in register B. 
Another object of the invention (again using any of the machines or methods described herein (infra or 
supra)) includes a method for carrying out conventional multiplication of two numbers whose average signif- 
55 leant length is n/2 bits, comprising carrying out modular multiplication of said numbers by the multiplication 
process as described by at least one of the methods described herein (infra or supra), wherein the modulus, 
N, is an n-bit number consisting of all "1's" (fffffff....fff), equating J 0 to 1, and loading the multiplicand in B and 
manipulating A as in said multiplication process of Claim 1; N can be all ones either by means of preloading 
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register N with all ones or by setting the multiplexer which outputs N to output a series of "hard" ones. 

A concurrent process and a unique hardware architecture have been provided, to preform modular expo- 
nentiation without division with the same number of operations as would be performed with a classic multi- 
plication/division device, wherein a classic device would perform both a multiplication and a division on each 
5 operation. 

Division is usually a non-deterministic process, and considered more difficult and time consuming than 
multiplication. 

The advantages realized in this invention result from a synchronized sequence of serial processes, which 
are merged to simultaneously (in parallel) achieve three multiplication operations on n bit operands, using two 
10 simple k bit serial/parallel multipliers in (n + 2k) effective clock cycles. 

By properly synchronizing and on the fly detecting and preloading operands, the machine operates in a 
deterministic fashion, wherein all multiplications and exponentiations are executed in a predetermined number 
of clock cycles. Conditional branches are replaced with local detection and compensation devices, thereby 
providing a basis for the simple type control mechanism, which, when refined, can consist of a series of self- 
15 exciting cascaded counters. 

The machine has particularly lean demands on volatile memory, as operands are loaded into and stored 
in the machine for the total length of the operation; however, the machine exploits the CPU onto which it is 
appended, to execute simple loads and unloads, and sequencing of commands to the machine, whilst the ma- 
chine performs its large number calculations. The exponentiation processing time is virtually independent of 
20 the CPU which controls it In practice, no architecturiai changes are necessary when appending the machine 
to any CPU. The hardware device is self-contained, and can be appended to any CPU bus. 

When using these and previously patented and public domain process controlling protocols; the means 
for accelerating the modular multiplication and exponentiation process is provided, with means for precal di- 
lating the necessary constants. 
25 The design of the preferred embodiments of the invention described herein was compacted and devised 
for the specific purpose of providing a modular mathematical operator for public key cryptographic applications 
on portable Smart Cards (identical in shape and size to the popular magnetic stripe credit and bank cards). 
These cards are to be used in a new generation of public key cryptographic devices for controlling access to 
computers, databases, and critical installations; to regulate and secure data flow in commercial, military and 
30 domestic transactions; to decrypt scrambled pay television programs, etc. 

It should be appreciated that the device may also be incorporated in computer and fax terminals, door 
locks, vending machines, etc 

The hardware described carries out modular multiplicat ion and exponentiation by applying the P operator 
in a new and original proceeding. Further, the squaring can be carried out in the same method, by applying it 
35 to a multiplicand and a multiplier that are equal. Modular exponentiation involves a succession of modular mul- 
tiplications and squarings, and therefore is carried out by a method which comprises the repeated, suitably 
combined and oriented application of the aforesaid multiplication squaring and exponentiation methods. How- 
ever, a novel and improved way of carrying out modular exponentiation will be further specified herein. 
The method for carrying out modular multiplication, wherein the multiplicand, A, the multiplier, B, and the 
40 modulus, N, comprise m characters of k bfts each, the multiplicand and the multiplier not being greater than 
the modulus, comprises the steps of. 

1 - precalculating a parameter H and at least the least significant character J 0 of another parameter J, as 
hereinafter defined, and loading Jo into a k bit register; 

2 - loading the multiplier B and the modulus N into respective registers of n bit length, wherein n = m-k; 
45 3 - setting an n-bit long register S to zero; and 

4 - carrying out an i-iteration m times, wherein i is from zero to m-1, each ith iteration comprising the fol- 
lowing operations: 

a) transferring the ith character A»_ , of the multiplicand Af rom Aj register means to storing means chos- 
en from among register and latch means; 
60 b) generating the value X = S(i-1) + Am * B, wherein S(M) is the "updated" value of S, as hereinafter 

defined, by : 

I cycle right shifting of the B register into multiplying means, 

II serially multiplying B by Am, 

III cycle right shifting of the modulus register N, 

55 IV determining the "updated" value of S(M) as the value stored in the S register after the (M)th 

iteration, if the same is not greater than N, or if it is greater than N, by serially subtracting N from 
it and assuming the resulting value as the "updated" value of S(M); and 
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V cycle right shifting of the register S and serially adding the value of the multiplication A^ B bit 

by bit to the "updated" value of S; 
c) multiplying the LS character of X, Xo by Jo and entering the value Xq • Jo mod 2 k into register means 
as Yo, while delaying N and X by k clock cycles; 
5 d) calculating the value Z = X + Y 0 N by: 

I multiplying Y 0 by N by a delayed right shifting of the N register concurrent with the aforesaid right 
cycle shifting thereof, and 

II adding X to the value of Y 0 N; 

e) ignoring the least significant character of Z and entering the remaining characters into the S register, 
10 whereby to enter Z/2 k , except for the last iteration; 

f) comparing ZI2 k to N bit by bit for the purpose of determining the updated value of S, S(i) in the manner 
hereinbefore defined; 

g) wherein the rth character of the multiplicand A| is loaded into the A register means at any time during 
the aforesaid operations; 

15 5) at the last (m th) iteration, ignoring the least significant character of Z/2k and entering the remaining 
characters into the B register, as the value of C * 0"(A* B)N; 

6) repeating the steps 3) to 4), wherein C or ON, if C is greater than N, is substituted for B and H is sub- 
stituted for A, whereby to calculate P = ?{C • H) mod N; and 

20 

7) assuming the value of ? obtained from the last iteration as the result of the operation A * B mod N. 
Also described is a method for performing the modular exponentiation of D = Ae mod N which comprises 

the following steps: 

1) loading the modulus number into the aforesaid register N; 
25 2) setting the aforesaid register S to zero; 

3) loading the base A to be exponentiated into the aforesaid register B; 

4) storing the exponent E in a computer register; 

5) shifting said exponent E left; 

6) ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said exponent 
30 E, and for all the following bits performing the operations 7 to 9: 

7) for every one of said bits, regardless of its being 0 or 1, squaring the content of register B by the mul- 
tiplication method hereinbefore set forth, wherein the successive characters of the base are loaded into 
register A) from register B; 

8) if and only if the current bit of the exponent E is 1 , multiplying, after performing operation 7), the content 
35 of register B by the base A; and 

9) after each Montgomery square or Montgomery multiply operation to perform a Montgomery C * H mul- 
tiplication ( • Hfo, and 

10) after performing steps 6-9 for all bits of E, storing the result of the last operation as D * A E mod N 
40 in register B. 

Furthers described is a method for performing modular exponentiation of D = Ae mod N which comprises 
the steps of: 

1) loading the modulus number into the aforesaid register N; 

2) setting the aforesaid register S to zero; 

45 3) loading the base A to be exponentiated into the aforesaid register B; 

4) storing the exponent E in a computer register, and a precalculated parameter T in the CPU memory; 

5) shifting said exponent E left; 

6) ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said exponent 
E, and for all the following bits performing the operations 7 to 8: 

so 7) for every one of said bits, regardless of its being 0 or 1 , carrying out operations 4 and 5 of the multi- 
plication method hereinbefore set forth, wherein both the multiplicand and the multiplier are the base A, 
and wherein the successive characters of the base are loaded into register A| from register B; 

8) if and only if the current bit of the exponent E is 1 , carrying out, after performing operation 7), operations 
4 and 5 of the multiplication method hereinbefore set forth, wherein the multiplicand is the content of reg- 

55 ister B and the multiplier is the base A; and 

9) after performing steps 7 and 8 for all bits of E, performing an additional Montgomery multiplication of 

register B by the parameter T ( ?(B T)h), and then storing the result of the last operation as D ¥ A E mod 
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N in register B. 

Parameter T is defined as T = (2") s mod N, wherein 
S = 29- 1 + E mod 29 1 , as explained in detail in the parent application. 

This invention provides an even more improved method for performing modular exponentiation of D = A E 
5 mod N, which comprises the steps of: 

1) storing the exponent E in a computer register. 

2) loading the modulus number into the aforesaid register N; 

3) setting the aforesaid register S to zero; 

4) performing a multiplication operation, of A* = ^A Hfo while A is the operand to be exponentiated, and 
10 H is a precalculated parameter as defined before. 

5) loading A* into the base register B. 

6) performing a squaring operation of the contents of register B. 

7) shifting said exponent E left; 

8) ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said exponent 
15 E, and for all the following bits performing the operations 9 to 10: 

9) for every one of said E bits, regardless of fts being 0 or 1 , carrying out operations 4 and 5 of the squaring 
method hereinbefore set forth, wherein both the multiplicand and the multiplier originate from the B reg- 
ister, and wherein the successive characters of the Montgomery multiplier are loaded into register A| from 
register B; 

20 10) if and only if the current bit of the exponent E is 1 , carrying out, after performing operation 9, opera- 
tions 4 and 5 of the multiplication method hereinbefore set forth, wherein the multiplicand is the content 
of register B and the multiplier is the base A*; and 

11) after performing steps 8-10 for all bits of E, performing an additional Montgomery multiplication of 
25 register B by the original base Aand then storing the result of the last operation as D ¥ , A E mod N in register 
B if the exponent is odd; if the exponent were even, perform an additional Montgomery multiplication of 
D times 1: 

B ¥ ^(D l) ¥ D l 

It is seen that the exponentiation method of this invention eliminates the need for the computation 

30 of the parameter T, hereinbefore mentioned. 

It has further been found, and this is another object of the present invention, that the machine described 
(in a 512 bit register size form) permits obtaining the result of the conventional multiplication of two n/2 bit 
numbers (actually any two operands which when multiplied will not cause a result longer than n bits, i.e. an 
overflow) without using the additional hardware or the cumbersome operations that would be required to ob- 

35 tain it according to the prior art This is achieved by carrying out modular multiplication of said numbers by 
the multiplication process, wherein the value of the modulus, N, is an n bit number consisting of all "Ts" 
(fffffff....fff), equating Jo to 1, and loading the multiplicand in B and manipulating A as in said multiplication 
process. 

The device for carrying out such multiplication in the normal field of numbers by the aforesaid method 
40 can be the same device which comprises control means including a CPU and a multiplication circuit which 
comprises: 

an n bit shift register B for the multiplier; 

an n bit shift register N for the modulus; 

an n bit shift register for the value S as herein defined; 
45 a k bit register A, for the multiplicand; 

k bit register means for the values Jo and Y 0 as herein defined; 

multiplier means for multiplying the content of the B register by that of the A| registrer; 

additional n-bit multiplier means; and adding, subracting, multiplexing and delay means. 

Preferably, all connections between the n bit registers and the remaining components are 1 bit serial con- 
50 nections. 

The invention further anticipates a method for performing modular exponentiation of D = A E mo6 N, sub- 
stantially as described. 

Finally the invention even further ant icpates a method for carrying out conventional multiplication of n/2- 
b'rt numbers, also substantially as described. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

In the drawings: 

Fig. 1 is a block diagram of an apparatus according to an embodiment of the invention; 

5 Fig. 2 is a block diagram of a modular multiplication circuit according to an embodiment of the invention; 

Fig. 3 shows the particular modular multiplication circuit according to an embodiment of the invention; 
Fig. 4 is a schematic diagram illustrating the timing relationship between the various operation of an der- 
ation of the multiplication operation according to an embodiment of the invention; 
Fig. 5 illustrates a serial/parallel multiplier cell; 

10 Fig. 6 illustrates an 8 bit serial/parallel multiplier; 
Fig. 7 illustrates a serial adder; 
Fig. 8 illustrates a serial subtracter; 

Fig. 9 illustrates an architecture for calculating the H parameter; 
More specifically- 

15 the figures depict several layers of logical concepts necessary tor understanding the device in its totality. In 
all cases, the clock signal motivates the circuit, and if there is a reset signal, its purpose is to initialize a circuit 
to a zero state. 

Detailed Description of Preferred Embodiments 

20 

Figure 1 is a block diagram of the monolithic circuit into which the invention is integrated. The MULT block 
contains the hardware device which is the basis for the invention; the State machine contains the controller 
which drives the MULT circuit; the ROM block contains all the non-volatile memory (ROM and EEPROM), 
wherein the program for controlling the Smart Card resides, the trusted third party public keys, and the pro- 

25 gram for driving the MULT block and the State machine; the RAM block contains the volatile memory which 
stores temporary operands, such as messages to be exponentiated, public keys to be authenticated, data in 
transit to the MULT block, eta; the CPU (central processing unit) can be virtually any popular microcontroller 
which has an 8 bit or wider internal bus. 

Fig. 2 shows in block diagram form a modular multiplication circuit according to the invention, which can 

30 be used for carrying out modular squaring and modular exponential ion. Numerals 10, 11 and 12 indicate three 
registers that are n bit long n = k m which constitute B, S and N registers respectively into which the multiplier 
value S and the modulus are loaded. The aforesaid registers are preferably divided into two n/2 registers, pre- 
ferably including a k least significant bit subdivision for the N and B registers. Multiplexers 13, 14 and 15 re- 
spectively are placed before the said registers, and if they are subdivided into component parts, a multiplexer 

35 is placed before each subdivision. Also shown in a block diagram, these registers are intended to be serially 
loaded, but it would also be possible to load them in parallel. 16, 17 and 18 are three registers, each of which 
is k bits long, for receiving the values A*, Jq, and Y 0 values respectively. Registers 16 and 17 are serial load- 
parallel output or serial and parallel load-parallel output shift registers. Register 1 8 is preferably a serial in par- 
allel output shift register. The content of these registers is intended to be processed by multiplying means 19 

40 and 20 through components 21 and 22, which are preferably k bit latches. If they are latches, they are loaded 
from registers 16, 17 and 18 through k bit buses. If they are registers, they can be serially loaded through 1 
bit connections. Numerals 24, 25, 25', 26, 36, 37 and 38 also designate multiplexers. Multipliers 19 and 20 
may be A serial, B parallel inputs, serial output multiplier means or any other serial/parallel inputs-serial outputs 
multiplying means. Multiplexer 38 can force the modulus N to be all "1"s for multiplying in the normal field of 

45 numbers. 

Numerals 27, 28, 29, 30, and 31 designate 1 bit full/half adder/subtract means. 31 designates a full ad- 
der/subtract means. 32, 33 and 34 designate k bit k clock cycle delay means capable of delaying digital signals, 
which may be composed of analog or digital components, though digital components are preferred. 35 is a 
Borrow detector, which is a two bit latch/storage means. As is seen, the device according to the invention al- 
so though it is intended to handle large numbers such as 512 bit numbers does not comprise buses, except op- 
tionally a few k bit buses, and this constitutes an important saving of hardware. When registers B, S and N 
comprise n/2 bit parts, the device of the invention can be used to carry out multiplication and exponentiation 
operations on 256 bit numbers, which is a substantial advantage as to the flexibility of the use of the device. 

Fig. 3 shows the logic cells according to one preferred embodiment of the invention. Operands are fed 
55 into the A) latch, the J 0 register, the B register and the N register via serial connect Dl, and results are unloaded 
via serial connect DO, from the B or S register. 

Signal X is the bit stream summation of the product of B and Aj and S. (Values after S and B have assumed 
values smaller than N.) Signal Y 0 is the k LS bit stream of the product of J 0 and X. Signal Z is the summation 
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of X and the product of Y 0 and N. The k LS bits of Z, being all zeros are disregarded, and only the n MS bits 
are serially fed into S or B. 

The Borrow_detector is a logic circuit which detects whether the value of Z/2 k is, or is not larger than N. 

The subtracters Sub1 and Sub2 subtract the bit stream N from the bit streams of B and S, whenever B 
5 or S is larger than N. 

Ad1 and Ad2 summate bit streams to produce X and Z streams. 

The Delay 1 and DeJay2 shift registers are necessary to provide storage for synchronizing the mathemat- 
ical processes. 

No clocking controls are included in the drawing. It is assumed that clocks are supplied by the state ma- 
10 chine whenever data must either emanate from or be fed into any of the above mentioned serially loaded/un- 
loaded logic circuits. 

Other controls are also not specified, Le., multiplexer addresses, latch transfer signals, etc., which should 
be obvious to those acquainted with the art from the explanatory material included in this specification. 
It will be evident to skilled persons how the device of Fig. 2 or Fig. 3 carries out the operations which con- 

15 stitute the multiplication method according to the invention. The timing relationship of said operation is, how- 
ever, further illustrated in Fig. 4. Said figure diagramatically illustrates all the various operations carried out 
in effective successive dock cycles in an embodiment of the invention, in which n = 512, k = 32 and m = 16. 
This is a fairly common situation in the encryption art When the invention is carried out according to the em- 
bodiment illustrated in Fig. 3, the same device can be used to operate with n = 256, as well. 

20 In Fig. 4 a succession of the various operations is illustrated as a function of the effective clock cycles, 

which are marked on the abscissa axis. At the beginning of the operation and before any of the iterations which 
form a part of the modular multiplication method according to the invention, the values of B, N and S are loaded 
in the respective registers. The first character of A is also loaded into the respective register. As soon as an 
iteration begins and during k clock cycles, the shifting of the content of the B and S registers is carried out 

25 The generation of the X value takes place during n+fc effective clock cycles, the first k clock cycles being oc- 
cupied by entering the value of Xo- During the first effective k dock cydes the value of Y 0 has been entered. 
During the next effective n+fc dock cydes, the value of X, which had been introduced into multiplier 20, is now 
shifted or introduced into adder 31 after having been delayed by delay 34. The value of N is used at three 
different time phases. First, to "update" S and B, second, delayed k effective dock cydes to multiply by Y 0 , 

30 and then delayed a second k effective dock cydes to sense how the next value of S or B will be "updated". 
During the same n+k effective dock cydes, Z is calculated, as well as ZJ2 k . The value of A, is loaded beginning 
with the first k effective dock cydes and continuing during the successive part of the iteration. The final value 
of Z/2 k is entered into register S (or B) during n dock cydes after the first 2k effective dock cydes. 

Fig. 5 shows an implementation of a serial/parallel multiplier cell (as an aid to those technical people who 

35 are familiar with the art, but who may not be aware of the workings of such a configuration). Each of these 
cells comprise an MPL block as shown in Fig. 6. 

Fig. 6 shows an implementation of an 8 bit serial/parallel multiplier. It implements Booth's multiplication 
algorithm for unsigned serial/parallel multiplication. In the ML1 and ML2 blocks of figure 3, the s/p multipliers 
are k bits long. Note that the MS cell is degenerate. The parallel 8 bit multiplicand is input on the XI connections 

40 and the n bit long serial multiplier is input on the Y connector (LS bit first, and a string of k zeros after the MS 
bit of the multiplier). The product is output on MO, LS bit first, MS bit last, wherein a full product is n + k bits 
long. 

Fig. 7 shows the serial adders for summating two bit streams which appear on A and B input connections, 
and outputs the summate stream on connect S. The LS bits are first to be input, and the output stream, for 
45 operands of m bits long is m+1 bits long. At the end of the m'th effective dock, the CI output is the (m +1)'th 
bit of the number string. 

Fig. 8 shows the serial subtracters for emitting the difference between two bit streams which appear on 
the A and B input connections, and output the difference stream on the D connection. The LS bits are first to 
be input, and the output stream, for operands of m bits long is m bits long. At the end of the m'th bit, the Bl 

so output is the (m + 1)'th bit of the number string and serves as a borrow out indication. 

Fig. 9 shows the hardware layout for calculating the H parameter for a module N, which is n bits long. During 
this mode of operation, for an n bit long module, the N register is rotated n + 1 times, synchronized to the rotate 
of the S register, which rotates through Sub1 with a delay of the LS bit (an LS zero is inserted at the first dock 
cyde in M2_1;1). The borrow detector "knows" at the end of the complete rotation whether or not N will be 

55 subtracted from the S stream on the next round, and switches the previous subtract multiplexer accordingly 
for the next round. 

As stated above, Fig. 1 illustrates in block diagram form a device for carrying out the methods according 
to the invention. Block CONTROL of the device includes: 
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1) A complete Central Processing Unit (CPU) 

2) Counters 

3) A State Machine. 

The CPU contains volatile and non-volatile memory some of which can be utilized by this multiplication 
5 process. The CPU controls the modular arithmetic block in the circuit 
The CPU: 

1) Communicates with a host 

2) Loads and unloads data to and from the chip. 

3) Commands the circuit to perform a sequence of mathematical operations. 

10 4) Is responsible for other cryptographic and noncryptographic; and data processing operations. 
The counters generate the address for the embodied State Machine. 

The State Machine decodes the addresses and generates control signals to the MULT block. These control 
signals command the MULT block to perform the proper sequence of operations necessary to calculate the 

0(A-B)m transformation (where A can be equal to B). 

Fig. 3 is a hardware block diagram of the hardware device that embodies the physical aspects of the in- 
vention (MULT), and is intended to aid in focusing onto several of the architectural concepts to be protected 
by this patent The block concurrently implements the sequence specified in equations (1) to (5), and also, 
without changing the synchronous clocking, the transformations of S and B from limited congruence to equal- 
ity. In this section we assume that the constants (functions of N), J 0 and H have been precalculated. 

20 

The circuit performs ^A B^. Using this function the circuit can be utilized to calculate: 

1) B-Amod N 

and 

2) B2mod N, 

25 wherein B must always be smaller than N. 

Implementing C = B-A mod N (A can be equal to B): 

1) The processor preloads the operand, B, into the B register, and the operand, N, into the N register. 

2) Each time as the circuit in MULT starts calculating the next value of S, the circuit signals (flags) the 

CPU to preload the next A). After the S(m)'th iteration, a number which has ¥ congruence to B resides 
30 in the B register. 

3) Block MULT calculates F = ^(B-Hfo where H is a precalculated constant, in a sequence identical to steps 
1) and 2), except that the processor will now preload the sequence of H characters (using the same se- 
quence as used when it previously loaded the A| characters). 

35 Implementing C = B 2 mod N: 

1) Assuming that register B contains a value which is known to be ¥ congruent to B, and the register N 
contains the module N (as is generally the case when squaring); the MULT block can now proceed to squar- 
ing by first preloading the Ai register with Bo, the LS character of Bq- 

40 2) The calculation B = ^B-Bfo proceeds like the second step in the multiplication operation, except that 
the subsequent loading of the B, characters is done serially "on the fly" from the B registers, as the B reg- 
ister rotates. 

3) Calculating ^(B H), if necessary, is identical to the previous step 3. 

As will be apparent to the skilled person, the inventors do not daim that the serial/parallel (sip) multipliers 
45 or any of the conventional components used form a part of the invention per se. The following is included to 
clarify the use of standard logic cells in the public domain as several of them are not commonly used. The 
gate implementation shown here is for demonstration only. Skilled technicians optimize these logic cells. 

The operands A, B and N are each n bit long, made of m groups of k bit long characters, therefore n = k m. 
In a hardware implementation where k = 32; m can be either 8 or 16 binary bits long. 

50 

ML1,ML2 

These multipliers execute the Booth's algorithm for unsigned multiplication, wherein the parallel operand 
is k cells (bits) long and the serially loaded operand can be of any required length. 
55 Each serial/parallel multiplier is made of k-1 MPL cells (figure 5). The most significant cell, its MS bit, con- 

sists of an AND gate, only. 

Each MPL cell multiplies the serial input Y with its parallel XI input bits and summates this result with the 
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serial output of the preceding MPL unit and Its own previous cycle's carry out bit 

The MPL cell is a 2 bit multiplier adder. The block multiplies the input bit XI and the serial input bit Y and 
summates the result with Dl, (Data In) and the carry CI, (Carry In) from the previous cycle. The final result is 
DO, (Data Out) and a CO, (Carry Out) for the next cycle. This carry out is stored in a Data Flip- Flop (D F-F). 
5 DO = (Dl + CI + XI Y) mod 2, 

and the saved carry CO will be the CI on the next cycle. This carry is the Boolean sum: 

CO = CIXIY + CI Dl ♦ Dl YXI. 

Ad1,Ad2 

10 

This is a simple 1 bit full adder with a D F-F, for saving the carry to be carried in at the next clock cycle 
(figure 7). 

The two inputs A and B are summated with the carry CI from the previous cycle to generate the modulo 
2 sum, which is saved in the D F-F for the output signal, S. Upon reset the carry bit is set to "0". 

15 

Sub1,Sub2. Sub3 

Each of the blocks, described in figure 8, is a full subtracter with a storage D F-F for the previous borrow. 
This block is similar to the Adi block with the exceptions that ft serially subtracts the B stream from the A 
20 stream. 

Pelayl, Delay2, Delay3 

These are k bit shift registers consisting of k 1 bit concatenated memory devices. They are used to syn- 
25 chronize the various operands in the mathematical sequence. This will become obvious as the circuit is ex- 
plained. 

Al, Jo, Y 0 

30 These blocks are k bit long serial-in/parallel-out shift registers, k input bits enter in serially. After k effective 
clock cycles, these k bits appear in parallel on the output 

In figure 2 the thin lines are serial one bit conductors, and the bold lines denote k bit parallel conductors. 

M4_1;x,M3_1;x, M2_1;x 

35 

These are one bit output multiplexers- M4_1;x which outputs I of 4 inputs - M3_1pc, which outputs 1 of 3 
inputs, and M2_1;x, which outputs 1 of inputs, x denotes the explicit index of a specific component 

B(0:k-1), B(k:n1-1), B(n1:n2), S(0:n1-1), S(n1:n2), N(0:k-1) t N(k:n1-1), N(n1:n2) 

40 

These are shift registers. The size and place in the sequence of a longer register is designated by the 
numbers in the brackets, e.g., X(s:t) is a t - s + 1 bit long shift register, s is the index for the first bit of X(s:t), 
and t is the index of the last bit of the X(s:t) register. For example, B(0:511) is composed of the three shorter 
cascaded registers: B(0:31), B{32:255) and B{256:511). 
45 n1 is generally equal to n/2, e.g., 256. n1 must be a multiple of k. 

n2 is equal to n-1 . 

k is the length of the machine character, i.e. the size of the serial/parallel multipliers. 
Therefore, in the first implementation the following values are anticipated: n1 = 256, n2 = 511, n = 512 
and k = 32. 

so 

Latch 1, Latch2 

These two latches are k bit registers. They are used to lock the parallel data into the multiplier to enable 
single clock parallel transitions in the multiplication sequences. 

55 
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MULT block operation - - £ f 8eld multiplications and exponentiations. 

For ease of explanation we have chosen to designate only those clock cycles which actually move data 
^ in registers; we define these "moving" cycles as "effective clock cycles". 

-£ (A B) M Multiplicatton 

Stage 1: Initial loading 
10 The following registers are loaded through Dl. 

1) J 0 into the J 0 register, (precalculated by the CPU) 

2) B into the B register. 

3) N into the N register. 

4) The first character of A, Ao, into the A2 register. 

15 Simultaneous to step 2, register S is loaded with zeros. 

After loading these five registers the two parallel serial unsigned multipliers ML1, ML2, the serial adders 
Ad1, and Ad2 and the serial subtracters Sub1 , Sub2 and Sub3 are reset 
Stage 2 : Executing the B Ao iteration. 

The data, Ao, loaded into register Ai is transferred into Latchl . Register B is cydidy shifted to the right At 
20 the initiation of a process the Borrow2 control signal is "0", therefore, the content of B simply passes un- 
changed through the subtracter Sub1 and is multiplied by Ao in ML1. Register B's output is fed back, un- 
changed, into the register's input 

The result of this multiplication is serially added in Ad 1 to the content of register S which is all zeros on 
this first iteration. This operation generates X as earlier described herein. 
25 While these processes are progressing, the CPU preloads the next character of A, A 1f into Latchl. 

J 0 from the J 0 register is loaded into Latch2. X is serially input to ML2 to be multiplied by J©. Thus after k 
effective clocks, the content of the register Y 0 is the k least significant bits of the product of Xo* Jo- 
Then, after these first k effective docks, ML2 is reset; the serial input multiplexer M3_1 ;4 is switched from 
the X stream to the N stream; the data in register Y 0 is parallel-loaded into Latch2 in place of J^ and the output 
30 is switched to the Y 0 *N stream. For the next n+k effective clock cydes the serial output result of the ML2 mul- 
tiplication will be Y 0 N. X which was delayed by k effective docks is now summated in Ad2 to the product stream 
of ML2; this generates Z = X + Y 0 N; a number wherein the k least significant bits are zeros. 

The first k bits from Ad2 being all zeros are disregarded and the next n bits are serially returned to the S 
register. This final quantity may be larger than or equal to N (in which case it must be reduced by N); i.e., S(1 ) 

35 ¥ S(1)modN. 

To detect if S ^ N; N is serially subtracted from this n bit long (Z/ 2 k ) stream in Sub3. However, only the 
n'th Borrow bit is stored in the bor row-save flip-flop. 

If this Borrow bit is "0" or the final carry bit CO of adder Ad2 is "1" then the new value in S is larger than 

N. 

40 

At the end of this first iteration, there is a value in the S register which is the * limited congruence of S(1 ) 
mod N; registers J 0 , B, and N retain the original values with which they were loaded; and the preload register, 
A), contains A^ 

Stage 3 : Subsequent B A| iterations. 
45 The next character of A, At is transferred into Latchl , the parallel input of ML1 . 

During the next and subsequent B A( iterations, at the end of each iteration, the content of S is ¥ to S(i) 
mod N. If S(i) :N, then N is to be subtracted from S(i) in Sub2. 

As each iteration commences, the next character of A, is loaded by the CPU into the preload register, 



P(B B) M Squaring operations. 

The first operation in a normal exponentiation is a squaring operation, performed like a normal multipli- 
55 cation with the multiplier A loaded into the B register, and the multiplicand loaded into the Ai register in k bit 
increments as described in the previous section. Subsequent squarings are performed on operands (multiplier 
and multiplicand) whose limited congruence resides in the B register. 



19 



B> 0 601 907 A2 



During such (P (B B^ squarings, from the outset the J 0 , S, B, and N registers are already loaded from a 
previous multiplication or squaring, and remain unchanged; however, at each iteration the Ai register must be 
loaded with a new character, derived from a k bit character which resides in the B register. 

For these subsequent squarings, the Ai register is preloaded from the B stream "on the fly". Once the CPU 
5 has given the command to square, it has no task to perform during the subsequent B B| squaring operations. 
The B,'s which are loaded, are segments of B which have flown through Sub1 (B| segments of B's which are 
already smaller than N). 
Stage 1: B Bo iteration 
y , 

w Initially, the last * of S from the previous calculation resides in the B register. 

The k LS bits of registers B and N are cyclically shifted to the right, thus after k effective clocks, the B 
and N registers are restored to their original states, the value in the B register is either the proper B value or 

the B-N value to be used for the next (P multiplication. So, for the first round, the Ai register is to be preloaded 
with either Bq which resides in the B register or the k LS bits of B-N. 
15 The purpose of this first kbft rotate is to be able to stream through Sub1 the first kbits of preload for register 
Ai. Immediately after being seriaDy loaded, Ai is unloaded into Latch 1 , and the Ai preload register is free to be 
loaded with B 1t the second character of B. 

During this and subsequent operations, as the Borrow2 signal is set or reset, the output string from Sub1 
is positive and always smaller than N. 
20 Now as all values are loaded into the registers, this first multiplication proceeds similarly to the B Aq iter- 
ation, as described in the previous section, except that as B rotates, as will be explained, B, is loaded into the 
Ai register (remember that in a multiplication the CPU toads the Ai register.) 

As the second k bit character, B 1 , emanates from the B stream, during this first B-Bq process, the Bi seg- 
ment is serially switched into the Ai preload register "on the fly" in preparation for the next squaring operation, 
25 Le., the SB 1 iteration. 

Stage 2: B B, iteration. 

The value loaded into the Ai register, B 1f is transferred to its output Latch 1 . During the next n + 2k (e.g., 
n + 64) effective clock cycles, the multiplication process on B.B t is performed as described above. 

As before, the signals Borrowl and Borrow2 determine if N is to be subtracted from the streams eman- 
30 ating from the B and S registers. If the number in the S register is larger than or equal to N then Borrowl is 
set and with subtracter Sub2, N is subtracted from S. N is subtracted from B, if necessary, for the duration 
of a complete m iteration multiplication loop. Such a condition would have been sensed with Borrow2 at the 
end of the previous multiplication or square. 

The two flip-flops, Borrowl and Borrow2 contain the final values of the conditioned Borrow Out from Sub3. 
35 Borrowl is set or reset after each iteration of S. Borrow2 is set or reset after the last S(m) iteration, whence 
B is loaded with S(m). The conditioned Borrow Out is the signal which indicates whether an S(i) is larger than 
N. 

During the BB t sequence, the B2 character is loaded "on the fly" into the Ai preload register as the B2 
character exits the Sub1 subtracter. 
40 Stage 3: Subsequent B-B| multiplication iterations L 

The remaining m-2 iterations are performed; during each one, the A| register is loaded with the value of 
B| character as it exits Sub1 , in preparation for the next loop. 

The final result, a limited congruence, resides in both the S and B registers. This data will be rectified at 
Sub1 , if necessary, as it is serially outputted through DO. 

45 

Operation of MULT block - Calculating the H parameter 

To calculate H, the machine is reconfigured to use registers S and N as in Figure 9. We demonstrate the 
operation of the operator, using the numerical example already employed above. This configuration performs 
50 an H calculation in n+1 rounds. On each round both S and N are rotated, each rotation being n effective clocks. 
On each round N circulates and returns unchanged. At the end of the Pth round, S and the "Next subtract" 

signal contain the equivalent of a limited ¥ congruence of S(i). 

55 The initial conditions -1st Round 

At the outset of the first round the module N is loaded into the N register and the borrow detect flag is 
reset, signifying that the first trial subtraction will be successful; the output flip flop of Subl is reset to zero. 
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For round 1 we know that the MS (nth) bit of the trial dividend is one. This bit is stored by inference in the 
"Next subtract" flip flop (no space in S). The "Next subtracT commands the S-N subtract in round 1 . 

We demonstrate, using the n=4 bit numerical example described above. 

H Calculating Mode - Initial Condition 
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Stored in the Borrow Detector's Next Subtract Flag 

At the outset, we know I N = 1011?, n=4 . 

that the dividend's MS bit 
is "1" 



Therefore, as we know that 
there could be no borrow- 
we reset the Next subtract 
flag to zero 



See Figure 7. 



Borrow Detect-Next gab tract 

signal is a zero - so on the 
first round M2_l;3 will feed 
N to Subl- The Diff will be 
S-N with a leading zero, or 
to be exact, .2- (S-N) . 

On the first clock cycle, 
the zero from the reset Subl 
output flip flop is fed into 
S*s MS cell S just as the LS 
bit froza S is fed into Subl. 



S 10) The contents of the S. reaxatex. 

0> 0 

(0) 0000 {0000}<* "Virtual Zeros** 

& {These "virtual** LS zeros are 
not affected by a trial- 
subtract. At each round there 
will be one less zero in the 
"virtual zero counter") 



50 
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(The LS bit of S is always a zero 
"drawn" from the "virtual" LS zero 
counter" . ) 

During the first n-1 clock cycles 
the LS n-1 bits of Diff will feed 
into S. 



N is rotated back into its MS bit cell. 
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15 



20 



The BO (Borrow out) aerial stream 
is equal to the series of borrows 
that result from the 

Diff mod 2 n - N stream, 
however, only the last borrow is 
sampled and may be relevant. 

On the n'th effective clock cycle, 
"Next subtract** will raise a flag 
for a subtract for the next round 
i£ the MS bit of Diff is H l", 
OR if BO = 



On the first round N is subtracted from 2", and n bits of the result multiplied by 2 (an LS zero insertion) 
is returned to the S register, EXCEPT for the MS bit which is stored "by inference" in the Borrow Detect_Next 
Subtract register. 

At the end of the first round rotate: 
25 S(1)=1010, Next subtract =1 (BO=1), and we know that on the next round there will be no subtract of S- 
Nin Sub1. 
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H Parameter Calculation - 2nd Round 



Stored in the Borrow Detector's Wext Subtract Flag 



10 



15 



20 



25 



30 



35 



40 



At the outset, we know 
that the second round 
subtraction would not be 
successful as BO = "T 
"detected" at Sub2. 



N = 1011,, n=4. 



S(l) the contents of the JL reals ter 
after the first round z. 

a 0 

-(1) 1010 ( 000)0 »3 virtual Zeros" l*eft 



Borrow Petebt-Heact subtract 
signal is a one - so on this 
round M2_l;3 will feed zeros 
to Subl- Diff - 2-S 
THKKB WAS HO SUBgUMST 

(The LS bit of S is again a zero 
"drawn** from the "virtual" LS zero 
counter" . ) 

For the subsequent n-1 clock cycles 
the LS n-1 bits of Diff = 2S will be 
fed into the S register. 



45 



50 



N is rotated back into its MS bit cell. 

As the MS bit of Diff is a "1", we 
know that on the next round we must 
subtract S-N. 

The sampled BO is irrelevant. 



55 Diff=l 0100 and S (2) =0100, Next subtract =0, and we know that on the next 
round there will be a subtract of S-N in Subl. 
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H Parameter Calculation - 3rd Round 



10 



15 



20 



25 



Stored in the Borrow Detec tor's Next Subtract Flao 

0 



At the outset, we know 
that the third round 
subtraction will be 
successful as the MS bit 
of Diff was "1" 



N = 1011,, n=4. 



S (2) The contents of the S. reajLatoer 
after the second round. 

-(0) 0100 {00) o "2 Virtual Zeros" Left 



Borrow Dotoct-Kext Subtract 
signal is a zero N will be 
subtracted from Diff . 



30 



For the subsequent n-1 clock cycles 
the LS n-1 bits of Diff * 2 (S-N) will 
be fed back into the S register. 



35 



As the MS bit of Diff is "l" in Subl 
in the next round we must subtract S-N. 



Diff-1 0010 and S (3) =0010, Next subtract =0, and we know that on the next 
4° round there will be a subtract of S-N in Subl. 

H Parameter Calculation - 4th Round 



45 



Stored in the Borrow Detector's Next Subtract Flag 



50 
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At the outset, we know 
that the fourth round 
subtraction will be 
successful as the MS bit 
of Diff was "1" 



N - 1011,, n-4. 



SO) The contents of the S_ regis tar 
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i after the round 

o o 

■(0) 0010 (0)P virtual Zeros" Left 



10 



Borrov Detect-Heatt Subtract 
signal is a zero N will be 
subtracted from Diff. 



15 As their was no borrow BO= W 0" in 
the next round we subtract S-N. 



20 



Diff=0 1110 and S (4) =1110, Next subtract -0, and we know that on the next 
round there will be a subtract of S-N in Subl. 
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H Parameter Calculation - n+1'th (Sth) Round 

Stored in the Borrow Detector's Next Subtract Flag 

0 



At the outset, we know 
that the fourth round 
subtraction will be 
successful as the MS bit 
of Diff was "1" 



N = 1011,, n=4. 



S (4) The contents of the S_ renister 
after the fourth round 

"(0) 1110 (IP "Ho Virtual Zeros" Left 
Laat Round 



Borrow Detect-Hext Subtract 
signal is a zero N will be 
subtracted from Diff. 



50 



Diff=0 0011 and S<5)=0011, is the remainder - which is the value of H. 
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Claims 



1) A compact synchronous microelectronic peripheral machine for standard microprocessors with means 
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for proper docking and control, having as essential elements: three main subdivided, switched and clocked 
shift registers, B (13), S (14) and N (15); two only multiplexed serial/parallel multipliers; borrow detectors (35), 
ancillary subtracters and adders (27,28,29,30 & 31); delay registers and switching elements; all of which em- 
body a totally integrated concurrent and synchronous process approach to modular multiplication, squaring, 
5 and exponentiation. 

2) A machine according to claim 1 which implements a unique, not straightforward, synchronized hardware 
derivation of the Montgomery method designed for hardware modular multiplication, squaring, and exponen- 
tiation. 

3) A machine according to claim 1 which executes a derivation of the Montgomery method as a multiplicity 
w of simultaneous serial processes, i.e., multiplications, subtractions, additions, stored delays, and a division 

by 2*; wherein the processes are executed in parallel, as serial processes merge. 

4) A machine according to claim 1 which executes a derivation of the Montgomery method as a multiplicity 
of serial processes for modular multiplication, squaring, and exponentiation, precluding the use of wide internal 
busses. 

15 5) A machine according to claim 1 which executes a derivation of the Montgomery method a multiplicity 
of serial processes for modular multiplication, squaring, and exponentiation and is sufficiently compact to be 
fabricated on a microchip as specified by the ISO 7816 standards for portable Smart Cards using popular 1 
micron technology. 

6) A machine according to claim 1 which executes a derivation of the Montgomery method as a multiplicity 
20 of serial processes for modular multiplication, squaring, and exponentiation, which can be controlled by any 

microprocessor with an internal bus, without changing fts basic architecture, and specifically, without rede- 
signing memories for dual port access, and with relatively small demands of firmware. 

7) Amachineaccx)rdingtodaim1usingthemicrocontrdlertoreg field sequences 
of squarings and multiplications, wherein the exponent E need not be stored in the MULT block, saving one n 
bit long shift register; simplifying the MULT control , while demanding little additional microcontroller ROM 
code. 

8) A machine according to daim 1, in which, as a result of loading the Ai register with the squaring multi- 
plicands "on the fly", while the B register is rotating; predudes unloading by the microcontroller of the previous 
final values of B and/or B-N, in order to reload the A| register with B] characters, in order to conserve micro- 
controller RAM, and eliminate at least n effective dock cydes on each squaring iteration. 

9) A machine according to daim 1, in which two storage registers and separate serial subtraction opera- 
tions are deleted from a straightforward implementation of the Montgomery method accomplished by enacting 
a single serial detection on 272* minus N, to determine if Z/2* is larger than or equal to N and subsequently 

^ achieving smaller than N operands with only one serial subtraction. 

10) A machine according to claim 1, in which the tircuit is synchronized in such a quasi-parallel manner 
so that only two multipliers are used to perform three simultaneous multiplication operations, wherein a silicon 
implementation serial/parallel multipliers can occupy 40% of the silicon area, and using two, instead of three 
serial/parallel multipliers, conserving enough silicon to double the number of cells in the remaining two mul- 
t ipliers; said doubling of the multiplier size reduces the process time for a 51 2 bit multiplication by more than 
45%. 

11 ) A machine according to claim 1 , in which the use of a digital delay circuit, Delay3, (a k bit shift register) 
to synchronize the serial addition of X with the serial result of the multiplier ML1, Y 0 -N; preduding double stor- 
age of products or a repetition of a serial/parallel multiplication. 

^ 12) A machine according to claim 1, in which two digital delay drcuits, Delay! ami Deiay2, (two k bit shift 

registers), are used to synchronize three serai multiplications in which N is a factor, i.e., B-A*, X-Jo, and Y 0 N. 

13) A machine according to daim 1, in which a digital delay circuit, Delay3, synchronizes the operation of 
the serial/parallel multiplier, ML2, so that it can perform two separate multiplication operations in the process 
stream, i.e., X-Jo and Y 0 *N. 

^ 14) A machine according to claim 1, in which the registers S, B, and N can be configured to be either n 
bits or n/2 bits long; whereby exponentiation over n/2 length modules can be accomplished in little more than 
one eighth of the effective clock cydes that would be necessary for n bit length exponentiations. 

15) A machine according to daim 1, which when processed with an original retrieval factor T, can reduce 
/o 

the number of w field multiplication operations on a full RSA signature exponentiation to dose to half. 
55 16) A machine according to daim 1, assuming necessary precalculations, can execute the complete mul- 
tiplication process <F (A-B)N of an n bit number in only m(n+2k) effective dock cydes, because of "on the fly" 
loading of the A register, and "on the fly prediction" of the size of the contents of the S register, and "on the 
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fly" synchronization of the partial operands. 

17) A machine according to claim 1 , using the same registers in the same machine as used for Montgomery 
multiplications, to which a small borrow detect circuit is appended and a simple addition to the controlling 
mechanism, operating in a second mode calculates the H parameter. 
5 18) A machine according to claim 1 , wherein every subprocess and process are executed with predeter- 

mined numbers of dock cycles, so that a P field multiplication and/or a squaring is performed in known se- 
quences of dock cydes, enabling the embodiment of a simplified control consisting of a cascade of self- 
exciting counting mechanisms with no internal conditional branches. 

19) A method for carrying out modular multiplication, wherein the multiplicand A , the multiplier B and the 
10 modulus N comprise m characters of k bits each, the multiplier not being greater than the modulus, which com- 
prises the steps of: 

1. precalculating a parameter H and at least the least significant character Jo of another parameter J, as 
hereinbefore defined, and loading J 0 into a k-bit register; 

2. loading the multiplier B and the modulus N into respective registers of n-bit length, wherein n=mk; 

3. setting an n-bit long register S to zero; and 

4. carrying out an i- iteration m times, wherein i is from zero to m-1, each ith iteration comprising the fol- 
lowing operations: 

a) transferring the ith character A|. i of the multiplicand Afrom Ai register means to storing means chos- 
en from among register and latch means; 

b) generating the value X = S(M) + At i)*B, wherein S(H) is the "updated" value of S, as hereinafter 
defined, by : 

I - cyde right shifting of the B register into multiplying means, 

II - serially multiplying B by Ai, 

III - cyde right shifting of the modulus N, 

IV - determining the "Updated" value of S{M) as the value stored in the S register after the (M)th 
iteration, if the same is not greater N, or if it is greater than N, by serially subtracting N from it and 
assuming the resulting value as the "updated" value of S(M); and 

V - cyde right shifting of the register S and serially adding the value of the multiplication Ap_ 1} -B bit 
^ by bit to the "Updated" value of S; 

c) multiplying the LS character of X(Xq) by J 0 and entering the value Xo-Jo ™od 2 k into register means 
as Yo, while delaying N and X by k dock cydes; 

d) calculating the value Z = X + Y°-N by: 

I - multiplying Y 0 by N by a delayed right shifting of the N register concurrent with the aforesaid 
right cyde shifting thereof, and 
35 II - adding X to the value of Y 0 -N; 

e) ignoring the least significant character of Z and entering the remaining characters into the S register, 
whereby to enter Z/2", except for the last iteration; 

f) comparing Z/2* to N bit by bit for the purpose of determining the updated value of S(M) in the manner 
^ hereinbefore defined; 

g) wherein the ith character of the multiplicand Ai is loaded into the A register means at any time during 
the aforesaid operations; 

5. at the last (mth) iteration, ignoring the least significant character of Z72* and entering the remaining char- 
acters into the B register, as the value of C * 0* (A-B)N; 

45 6. repeating the steps 3) to 4), wherein C or C-N, if C is greater than N, is substituted for B and H is sub- 
stituted for A, whereby to calculate P= P (C-H)N; and 7) assuming the value of P obtained from the last 
iteration as the result of the operation A B modN. 

20) A method according to claim 19, wherein n is chosen from among 256 and 512 or increments of mul- 
tiples of k, and k is 32. 

21) A method for performing modular squaring, comprising carrying out modular multiplication according 
to claim 19, wherein multiplicand and multipier are the same number. 

22) A met hod for performing modular exponentiation D=A E modN, comprising carrying out multiplications 
and squarings by the method of daim 19. 

23) A method according to daim 22, comprising the steps of: 

1. loading the modulus into the register N; 

2. setting the register S to zero; 

3. loading the base A to be exponentiated into the register B; 



27 



EP0601907 A2 



4. storing the exponent E in a computer register; . 

5. shifting said exponent E left; 

6. ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said ex- 
ponent E, and for all the following bits performing the operations 7 to 8: 

5 7. for every one of said bits, regardless of its being 0 or 1, squaring the content of register B by the 

multiplication method hereinbefore set forth, wherein the successive characters of the base are loaded 
into register A| from register B; 

8. if and only if the current bit of the exponent E is 1, multiplying, after performing operation 7), the 
content of register B by the base A; and 

10 9. after performing steps 6-8 for all bits of E, storing the result of the last operation as D ¥ A E modN 

in register B. 

24) A method for performing modular exponentiation D = A E modN, by performing iterations according to 
claim 19, comprising the steps of. 

1. loading the modulus into the register N; 

2. setting the register S to zero; 

3. loading the base A to be exponentiated into the register B; 

4. storing the exponent E in a computer register, and a precalculated parameter T, as hereinbefore de- 
fined, in the CPU memory; 

5. shifting said exponent E left; 

6. ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said ex- 
ponent E, and for all the following bits performing the operations 7 to 8: 

7. for every one of said bits, regardless of its being 0 or 1 , carrying out operations 4 and 5 of the mul- 
tiplication method of claim 1, wherein both the multiplicand and the multiplier are the base A, and 
wherein the successive characters of the base are loaded into register Ai from register B; 

8. if and only if the current bit of the exponent E is 1 , carrying out, after performing operation 7), op- 
erations 4) and 5) of the multiplication method of claim 1, wherein the multiplicand is the content of 
register B and the multiplier is the base A; and 

9. after performing steps 7-8 for ail bits of E, performing an additional multiplication of the content of 

30 register B by the parameter T, and storing the result of the last operation as D ¥ A E modN in register 

B. 

25) A device for carrying out modular multiplication by the method of claim 1 9, comprising control means 
including a CPU and a multiplication circuit which comprises: 

an n-bit shift register B for the multiplier; 
35 an n-bit shift register N for the modulus ; 

an n-bit shift register for the value S as herein defined; 

a k-brt register A| for the multiplicand; 

k-bit register means for the values Jo and Y 0 as herein defined; 

multiplier means for multiplying the content of the B register by that of the A 1 register, 
40 additional n-bit multiplier means; and 

adding, subtracting, multiplexing and delay means. 

26) A device according to daim 25, wherein all connections between the n-bit registers and the remaining 
components and between components none of which is a latch, are 1-bit connections. 

27) A method for performing modular exponentiation of D = A E mod N, which comprises the steps of: 
45 1. storing the exponent E in a computer register; 

2. loading the modulus into the aforesaid register N; 

3. setting the aforesaid register S to zero; 

4. performing a multiplication operation, by the method of application No. 104753, of A*= 0* (A H)n 
50 while A is the operand to be exponentiated, and H is a precalculated parameter as defined before; 

5. loading A* into the base register B; 

6. performing a squaring operation of the contents of register B; 

7. shifting said exponent E left; 

8. ignoring all the zero bits thereof which precede the first 1 bit and ignoring the first 1 bit of said ex- 
55 ponent E, and for all the following bits performing the operations 9 to 10; 

9. for every one of said E bits, regardless of their being 0 or 1 , carrying out operations 4 and 5 of the 
squaring method hereinbefore set forth, wherein both the multiplicand and the multiplier originate from 
the B register, and wherein the successive characters of the Montgomery multiplier are loaded into 



28 



EP 0601 907 A2 



register A| from register B; 

1 0. if and only if the current btt of the exponent E is 1 , carrying out after performing operation 9, op- 
erations 4 and 5 of the multiplication method hereinbefore set forth, wherein the multiplicand is the 
content of register B and the multiplier is the base A*; and 
5 11 . after performing steps 8-1 0 for all bits of E, performing an additional multiplication of register B by 

the original base A and then storing the result of the last operation as D V A E modN in register B. 

28) A method for carrying out conventional multiplication of two numbers whose average significant length 
is n/2 bits, comprising carrying out modular multiplication of said numbers by the multiplication process of 
Claim 19, wherein the modulus, N, is an n-brt number consisting of all Ts" (fffffff....fff), equating Jo to 1, and 
loading the multiplicand in B and manipulating A as in said multiplication process of Claim 1; N can be all ones 
either by means of preloading register N with ail ones or by setting the multiplexer which outputs N to output 
a series of "hard* ones. 

29) A method according to daim 27 & 28, carried out by means of the device of any one of the preceding 
1S claims. 

30) A method for performing modular exponentiation of D = A E mod N, substantially as described. 

31) A method for carrying out conventional multiplication of n/2-bit numbers, substantially as described. 
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